Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/ROCM/COPY: Use faster memcpy for device to host copies #4532

Merged
merged 1 commit into from
Dec 4, 2019

Conversation

souravzzz
Copy link
Member

@souravzzz souravzzz commented Dec 2, 2019

What

Use faster memcpy for device to host copies

Why ?

Native memcpy is slow for device to host copies

How

Use non-temporal load (MOVNTDQA) to implement memcpy

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@shamisp
Copy link
Contributor

shamisp commented Dec 2, 2019

ok to test

src/uct/rocm/Makefile.am Outdated Show resolved Hide resolved
src/uct/rocm/copy/rocm_copy_impl.h Outdated Show resolved Hide resolved
@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 3 of 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@souravzzz souravzzz force-pushed the topic/sourav/rocm-memcpy-nt branch 2 times, most recently from ce7608e to 2bf407f Compare December 3, 2019 17:16
@shamisp shamisp requested a review from yosefe December 3, 2019 18:59
@shamisp
Copy link
Contributor

shamisp commented Dec 3, 2019

@souravzzz Do you happen to have performance numbers that can help support this optimization ?

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@souravzzz
Copy link
Member Author

Hi @shamisp here are some intra-node D2D numbers with rocm_copy transport that shows the improvements from this change.

#bytes old new
0 0.33 0.34
1 1.77 1.63
2 2.65 1.63
4 2.66 1.63
8 2.65 1.63
16 2.66 1.63
32 4.71 1.63
64 8.57 2.18
128 16.35 1.91
256 15.50 1.79
512 26.73 2.12
1K 49.81 3.05
2K 56.25 5.17
4K 111.84 10.26

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@shamisp
Copy link
Contributor

shamisp commented Dec 3, 2019

Looks impressive. What about d2h ? thanks

@souravzzz
Copy link
Member Author

@shamisp We see good improvement for D2H transfers as well.

#bytes old new
0 0.35 0.33
1 0.86 0.84
2 1.39 0.84
4 1.38 0.84
8 1.37 0.84
16 1.39 0.84
32 2.44 0.86
64 4.44 0.86
128 8.51 0.87
256 8.08 0.86
512 13.69 1.20
1K 25.59 1.65
2K 31.75 2.79
4K 57.62 4.87

@shamisp
Copy link
Contributor

shamisp commented Dec 3, 2019

Looks good to me. @yosefe ?

src/ucs/arch/aarch64/cpu.h Outdated Show resolved Hide resolved
src/ucs/arch/ppc64/cpu.h Outdated Show resolved Hide resolved
src/ucs/arch/x86_64/cpu.c Outdated Show resolved Hide resolved
src/ucs/arch/x86_64/cpu.c Outdated Show resolved Hide resolved
src/ucs/arch/x86_64/cpu.c Outdated Show resolved Hide resolved
src/uct/rocm/copy/rocm_copy_ep.c Outdated Show resolved Hide resolved
src/uct/rocm/copy/rocm_copy_ep.c Outdated Show resolved Hide resolved
@souravzzz
Copy link
Member Author

Thanks for the feedback @yosefe. I have incorporated the suggested changes.

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 11-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@shamisp shamisp merged commit 7923499 into openucx:master Dec 4, 2019
@souravzzz souravzzz deleted the topic/sourav/rocm-memcpy-nt branch December 4, 2019 22:25
@Akshay-Venkatesh
Copy link
Contributor

@souravzzz Curious to know if this test has the GPU touch the buffers transferred before each transfer.

@souravzzz
Copy link
Member Author

@Akshay-Venkatesh No these are results from the standard osu_latency benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants